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Abstract 

A central problem in data analysis is the low dimensional representation of high di¬ 
mensional data, and the concise description of its underlying geometry and density. In 
the analysis of large scale simulations of complex dynamical systems, where the notion 
of time evolution comes into play, important problems are the identification of slow 
variables and dynamically meaningful reaction coordinates that capture the long time 
evolution of the system. In this paper we provide a unifying view of these apparently 
different tasks, by considering a family of diffusion maps , defined as the embedding 
of complex (high dimensional) data onto a low dimensional Euclidian space, via the 
eigenvectors of suitably defined random walks defined on the given datasets. Assuming 
that the data is randomly sampled from an underlying general probability distribution 
p(x) = e~ utyX \ we show that as the number of samples goes to infinity, the eigenvectors 
of each diffusion map converge to the eigenfunctions of a corresponding differential op¬ 
erator defined on the support of the probability distribution. Different normalizations 
of the Markov chain on the graph lead to different limiting differential operators. For 
example, the normalized graph Laplacian leads to a backward Fokker-Planck operator 
with an underlying potential of 2U(x), best suited for spectral clustering. A specific 
anisotropic normalization of the random walk leads to the backward Fokker-Planck op¬ 
erator with the potential U(x), best suited for the analysis of the long time asymptotics 
of high dimensional stochastic systems governed by a stochastic differential equation 
with the same potential U(x). Finally, yet another normalization leads to the eigen¬ 
functions of the Laplace-Beltrami (heat) operator on the manifold in which the data 
resides, best suited for the analysis of the geometry of the dataset, regardless of its 
possibly non-uniform density. 


1 Introduction 

Analysis of complex high dimensional data is an exploding area of research, with applica¬ 
tions in diverse fields, such as machine learning, statistical data analysis, bio-informatics, 
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meteorology, chemistry and physics. In the first three application fields, the underlying as¬ 
sumption is that the data is sampled from some unknown probability distribution, typically 
without any notion of time or correlation between consecutive samples. Important tasks are 
dimensionality reduction, e.g., the representation of the high dimensional data with only a 
few coordinates, and the study of the geometry and statistics of the data, its possible decom¬ 
position into clusters, etc [TjJ. In addition, there are many problems concerning supervised 
learning, in which additional information, such as a discrete class g(x) € {gi ,..., g^} or a 
continuous function value f(x) is given to some of the data points. In this paper we are con¬ 
cerned only with the unsupervised case, although some of the methods and ideas presented 
can be applied to the supervised or semi-supervised case as well |2j. 

In the later three above-mentioned application fields the data is typically sampled from 
a complex biological, chemical or physical dynamical system, in which there is an inherent 
notion of time. Many of these systems involve multiple time and length scales, and in 
many interesting cases there is a separation of time scales, that is, there are only a few 
’’slow” time scales at which the system performs conformational changes from one meta¬ 
stable state to another, with many additional fast time scales at which the system performs 
local fluctuations within these meta-stable states. In the case of macromolecules the slow 
time scale is that of a conformational change, while the fast time scales are governed by the 
chaotic rotations and vibrations of the individual chemical bonds between the different atoms 
of the molecule, as well as the random fluctuations due to the frequent collisions with the 
surrounding solvent water molecules. In the more general case of interacting particle systems, 
the fast time scales are those of density fluctuations around the mean density profiles, while 
the slow time scales correspond to the time evolution of these mean density profiles. 

Although on the fine time and length scales the full description of such systems requires a 
high dimensional space, e.g. the locations (and velocities) of all the different particles, these 
systems typically have an intrinsic low dimensionality on coarser length and time scales. 
Thus, the coarse time evolution of the high dimensional system can be described by only a 
few dynamically relevant variables, typically called reaction coordinates. Important tasks in 
such systems are the reduction of the dimensionality at these coarser scales (known as homog¬ 
enization), and the efficient representation of the complicated linear or non-linear operators 
that govern their (coarse grained) time evolution. Additional goals are the identification of 
the meta-stable states, the characterization of the transitions between them and the efficient 
computation of mean exit times, potentials of mean force and effective diffusion coefficients 

pel 3 (i. 

In this paper, following [7j, we consider a family of diffusion maps for the analysis of 
these problems. Given a large dataset, we construct a family of random walk processes 
based on isotropic and anisotropic diffusion kernels and study their first few eigenvalues and 
eigenvectors (principal components). The key point in our analysis is that these eigenvectors 
and eigenvalues capture important geometrical and statistical information regarding the 
structure of the underlying datasets. 

It is interesting to note that similar approaches have been suggested in various different 
fields. In graph theory, the first few eigenvectors of the normalized graph Laplacian have been 
used for spectral clustering Pi, approximations to the optimal normalized-cut problem in 
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and dimensionality reduction [mini, to name just a few. Similar constructions have also 
been used for the clustering and identification of meta-stable states for datasets sampled 
from dynamical systems (3j. However, it seems that the connection of these computed 
eigenvectors to the underlying geometry and probability density of the dataset has not been 
fully explored. 

In this paper, we consider the connection of these eigenvalues and eigenvectors to the 
underlying geometry and probability density distribution of the dataset. To this end, we 
assume that the data is sampled from some (unknown) probability distribution, and view 
the eigenvectors computed on the finite dataset as discrete approximations of corresponding 
eigenfunctions of suitably defined continuum operators in an infinite population setting. As 
the number of samples goes to infinity, the discrete random walk on the set converges to 
a diffusion process defined on the n-dimensional space but with a non-uniform probability 
density. By explicitly studying the asymptotic form of the Chapman-Kolmogorov equations 
in this setting (e.g., the infinitesimal generators), we find that for data sampled from a general 
probability distribution, written in Boltzmann form as p(x) = e~ u( - x \ the eigenfunctions 
and eigenvalues of the standard normalized graph Laplacian construction correspond to a 
diffusion process with a potential 2 U(x) (instead of a single U(x)). Therefore, a subset of 
the first few eigenfunctions are indeed well suited for spectral clustering of data that contains 
only a few well separated clusters, corresponding to deep wells in the potential U(x). 

Motivated by the well known connection between diffusion processes and Schrodinger 
operators ra, we propose a different novel non-isotropic construction of a random walk 
on the graph, that in the asymptotic limit of infinite data recovers the eigenvalues and 
eigenfunctions of a diffusion process with the same potential U(x). This normalization, 
therefore, is most suited for the study of the long time behavior of complex dynamical 
systems that evolve in time according to a stochastic differential equation. For example, 
in the case of a dynamical system driven by a bistable potential with two wells, (e.g. with 
one slow time scale for the transition between the wells and many fast time scales) the 
second eigenfunction can serve as a parametrization of the reaction coordinate between the 
two states, much in analogy to its use for the construction of an approximation to the 
optimal normalized cut for graph segmentation. For the analysis of dynamical systems, we 
also propose to use a subset of the first few eigenfunctions as reaction coordinates for the 
design of fast simulations. The main idea is that once a parametrization of dynamically 
meaningful reaction coordinates is known, and lifting and projection operators between the 
original space and the diffusion map are available, detailed simulations can be initialized 
at different locations on the reaction path and run only for short times, to estimate the 
transition probabilities to different nearby locations in the reaction coordinate space, thus 
efficiently constructing a potential of mean held and an efficient diffusion coefficient on the 
reaction path T%| . 

Finally, we describe yet another random walk construction that in the limit of infinite data 
recovers the Laplace-Bcltrami (heat) operator on the manifold on which the data resides, re¬ 
gardless of the possibly non-uniform sampling of points on the manifold. This normalization 
is therefore best suited for learning the geometry of the dataset, as it separates the geometry 
of the manifold from the statistics on it. 
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Our analysis thus reveals the intimate connection between the eigenvalues and eigen¬ 
functions of different random walks on the finite graph to the underlying geometry and 
probability distribution p = e~ u from which the dataset was sampled. These findings lead 
to a better understanding of the advantages and limitations of diffusion maps as a tool to 
solve different tasks in the analysis of high dimensional data. 


2 Problem Setup 

Consider a finite dataset e M ra . We consider two different possible scenarios for the 

origin of the data. In the first scenario, the data is not necessarily derived from a dynamical 
system, but rather it is randomly sampled from some arbitrary probability distribution p(x). 
In this case we define an associated potential 

U(x) — — logp(cc) (1) 


so that p — e u . 

In the second scenario, we assume that the data is sampled from a dynamical system in 
equilibrium. We further assume that the dynamical system, defined by its state x(t) e M" 
at time t, satisfies the following generic stochastic differential equation (SDE) 

x = —\7U(x) + V2w (2) 


where a dot on a variable means differentiation with respect to time, U is the free energy at 
x (which, with some abuse of nomenclature, we will also call the potential at x), and w(t) is 
an n-dimensional Brownian motion process. In this case there is an explicit notion of time, 
and the transition probability density p(x, t\y , s) of finding the system at location x at time 
t, given an initial location y at time s (t > s), satisfies the forward Fokker-Planck equation 
(FPE) 

^ = V-(Vp + pVU(x)) (3) 


with initial condition 


lim pfx, t\y, s ) = 5(x — y) 

t^s+ 


(4) 


Similarly, the backward Fokker-Planck equation for the density p(x, t\y , s), in the backward 
variables y,s (s < t) is 

- ^ = A p-Vp- VU(y) (5) 

os 

where differentiations in © are with respect to the variable y, and the Laplacian A is a 
negative operator, defined as A u = V • (Vu). 

As time t —► oo the steady state solution of © is given by the equilibrium Boltzmann 
probability density, 

p(x)dx = Pr {x}dx = ^tEMl dx (6) 

Zj 
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where Z is a normalization constant (known as the partition function in statistical physics), 
given by 

Z — f exp(—U(x))dx (7) 

J R™ 

In what follows we assume that the potential U(x) is shifted by the suitable constant (which 
does not change the SDE 0), so that Z = 1. Also, we use the notation y(x) = Pr{*} = 
p(x) = interchangeably to denote the (invariant) probability measure on the space. 

Note that in both scenarios, the steady state probability density, given by © is identical. 
Therefore, for the purpose of our initial analysis, which does not directly take into account 
the possible time dependence of the data, it is only the features of the underlying potential 
U(x) that come into play. 

The Langevin equation 0 or the corresponding Fokker-Planck equation © are com¬ 
monly used to describe mechanical, physical, chemical, or biological systems driven by noise. 
The study of their behavior, and specifically the decay to equilibrium has been the subject 
of much theoretical research DU- In general, the solution of the Fokker-Planck equation m 
can be written in terms of an eigenfunction expansion 

OO 

p( x ,t) = X ' , p j (x) (8) 

3=0 

where —A j are the eigenvalues of the FP operator, with Ao = 0 < Ai < A 2 < ..., (pj(x) are 
their corresponding eigenfunctions, and the coefficients cq depend on the initial conditions. 
Obviously, the long term behavior of the system is governed only by the Erst few eigenfunc¬ 
tions ... ,(pk, where k is typically small and depends on the time scale of interest. 

In low dimensions, e.g. n < 3 for example, it is possible to calculate approximations to 
these eigenfunctions via numerical solutions of the relevant partial differential equations. In 
high dimensions, however, this approach is in general infeasible and one typically resorts to 
simulations of trajectories of the corresponding SDE 0. In this case, there is a need to 
employ statistical methods to analyze the simulated trajectories, identify the slow variables, 
the meta-stable states, the reaction pathways connecting them and the mean transition times 
between them. 


3 Diffusion Maps 

3.1 Finite Data 

Let {;denote the N samples, either merged from many different simulations of the 
stochastic equation 0, or simply given without an underlying dynamical system. In Hj, 
Coifman and Lafon suggested the following method, based on the definition of a Markov 
chain on the data, for the analysis of the geometry of general datasets: 

For a fixed value of e (a metaparameter of the algorithm), define an isotropic diffusion 
kernel, 

k E (x, y ) = exp ( 9 ) 
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Assume that the transition probability between points x t and x 3 is proportional to k £ (x t , Xj), 
and construct aniVxiV Markov matrix, as follows 


M(i,j) 


ke (Xii Xj ) 

Pe(Xj) 


where p £ is the required normalization constant, given by 


Pe(Xj) = ^ k £ (Xi, Xj) 
i 


( 10 ) 


( 11 ) 


For large enough values of e the Markov matrix M is fully connected (in the numerical sense) 
and therefore has an eigenvalue Ao = 1 with multiplicity one and a sequence of additional 
n — 1 non-increasing eigenvalues Xj < 1, with corresponding eigenvectors (fj. 

The diffusion map at time m is defined as the mapping from x to the vector 


= (A£Vo(a:), \™ipi(x ),..., \%ip k (x)) 

for some small value of k. In [7], it was demonstrated that this mapping gives a low di¬ 
mensional parametrization of the geometry and density of the data. In the field of data 
analysis, this construction is known as the normalized graph Laplacian. In ra. Shi and 
Malik suggested using the first non-trivial eigenvector to compute an approximation to the 
optimal normalized cut of a graph, while the first few eigenvectors were suggested by Weiss 
et al. PfHl for clustering. Similar constructions, falling under the general term of kernel 
methods have been used in the machine learning community for classification and regression 
• In this paper we elucidate the connection between this construction and the underlying 
potential U(x). 


3.2 The Continuum Diffusion Process 

To analyze the eigenvalues and eigenvectors of the normalized graph Laplacian, we consider 
them as a finite approximation of a suitably defined diffusion operator acting on the con¬ 
tinuum probability space from which the data was sampled. We thus consider the limit of 
the above Markov chain process as the number of samples approaches infinity. For a finite 
value of e, the Markov chain in discrete time and space converges to a Markov process in 
discrete time but continuous space. Then, in the limit e —> 0, this jump process converges 
to a diffusion process on M n , whose local transition probability depends on the non-uniform 
probability measure y(x) = e~ u ^ x k 

We first consider the case of a fixed e > 0, and take N —> oo. Using the similarity of © 
to the diffusion kernel, we view e as a measure of time and consider a discrete jump process 
at time intervals At = s, with a transition probability between points y and x proportional 
to k e (x,y). However, since the density of points is not uniform but rather given by the 
measure y(x), we define an associated normalization factor p e (y) as follows, 

Ps(y) = / k £ (x,y)y(x)dx (12) 
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and a forward transition probability 


(13) 


M f (x\y) = Pr (x(t + e) = x\x(t) = y) = 

Pe{y) 

Equations m and (HI are the continuous analogues of the discrete equations (HU) and m- 
For future use, we also define a symmetric kernel M s (x,y) as follows, 


M s (x,y) 


yj.Ps(x)p e (y) 


(14) 


Note that p e {x) is an estimate of the local probability density at x, computed by averaging 
the density in a neighborhood of radius 0(e 1//2 ) around x. Indeed, as e —> 0, we have that 

p £ (x) = p{x) + |a p(x) + 0(e 3/2 ) (15) 


We now define forward, backward and symmetric Chapman-Kolmogorov operators on 
functions defined on this probability space, as follows, 


T f[p\( x ) = 

f M f (x\y)<p(y)dy(y) 

(16) 

T b [ip\(x) = ^ 

[ M f {y\x)(p{y)dy(y) 

(17) 

T s [(p\(x) = 

[ M a (x,y)ip(y)dy(y) 

(18) 


If p(x) is the probability of finding the system at location x at time t — 0, then Tf[ip\ is the 
evolution of this probability to time t — e. Similarly, if ip{z) is some function on the space, 
then Tb[i(j\(x) is the mean (average) value of that function at time £ for a random walk that 
started at x, and so Tl n [i/j](x) is the average value of the function at time t = me. 

By definition, the operators Tj and are adjoint under the inner product with weight 
/i, while the operator T s is self adjoint under this inner product, 

(Tf<p, V’)/, = {<P, T^)^, (T s ip, ip)^ = ((p, T s ^) m (19) 

Moreover, since T s is obtained via conjugation of the kernel Mf with a Jp e {x) all three 
operators share the same eigenvalues. The corresponding eigenfunctions can be found via 
conjugation by yFor example, if T s ip s = A tp s , then the corresponding eigenfunctions for 
Tf and Tb are <pf = \fplPs and ipb = Ps/y/Pe, respectively. Since is the first eigenfunction 
with A 0 = 1 of T S1 the steady state of the forward operator is simply p e (x), while for the 
backward operator it is the uniform density, <pb = 1. 

Obviously, the eigenvalues and eigenvectors of the discrete Markov chain described in the 
previous section are discrete approximations to the eigenvalues and eigenfunctions of these 
continuum operators. Rigorous mathematical proofs of this convergence as N —> oo under 
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various assumptions have been recently obtained by several authors [221 22]. Therefore, for 
a better understanding of the finite sample case, we are interested in the properties of the 
eigenvalues and eigenfunctions of either one of the operators Tf,T h or T s , and how these 
relate to the measure p(x) and to the corresponding potential U(x). To this end, we look 
for functions ip{x) such that 


T jV 


Mj(x, y)<p(y) Pr {y}dy = A <p(x) 


( 20 ) 


where j G {/, b, s}. 

While in the case of a finite amount of data, e must remain finite so as to have enough 
neighbors in a ball of radius 0(e 1//2 ) near each point x, as the number of samples goes to 
infinity, we can take smaller and smaller values of e. Therefore, it is instructive to look at the 
limit e —> 0. In this case, the transition probability densities of the continuous in space but 
discrete in time Markov chain converge to those of a diffusion process, whose time evolution 
is described by a differential equation 


dip 

~dt 


= H f ip 


where Hf is the infinitesimal generator or propagator of the forward operator, defined as 


H f = lim -—^ 

As shown in the Appendix, by computing the asymptotic expansion of the corresponding 
integrals in the limit £ —> 0, we obtain that 


Hf<p = Ap — <p (e u Ae u ) 


( 21 ) 


Similarly, the inifinitesimal operator of the backward operator is given by 

T — I 

H b i\) = lim —- il> = Aip- 2S7ip ■ VU (22) 

£—^0 £ 

As expected, Vb = 1 is the eigenfunction with Ao = 0 of the backward infinitesimal operator, 
while (fo = eT u is the eigenfunction of the forward one. 

A few important remarks are due at this point. First, note that the backward operator 
o has the same functional form as the backward FPE Q. but with a potential 2U(x) 
instead of U(x). The forward operator (12 1 1 ) has a different functional form from the forward 
FPE © corresponding to the stochastic differential equation ((21). This should come as no 
surprise, since m is the differential operator of an isotropic diffusion process on a space 
with non-uniform probability measure /j.(x), which is obviously different from the standard 
anisotropic diffusion in a space with a uniform measure, as described by the SDE © m- 

Interestingly, however, the form of the forward operator is the same as the Schrodinger 
operator of quantum physics la e.g. 


Hip = A p — ipV (x) 


(23) 




where in our case the potential V(x) has the following specific form 

V(x) = (W(a0) 2 - AU(x). (24) 

Therefore, in the limit N —> oo,e —>■ 0, the eigenfunctions of the diffusion map are the same 
as those of the Schrodinger operator (ESI) with a potential El- The properties of the first 
few of these eigenfunctions have been extensively studied theoretically for simple potentials 

v(x) mg. 

In order to see why the forward operator TL / also corresponds to a potential 2 U(x) instead 
of U(x), we recall that there is a well known correspondence T^J, between the Schrodinger 
equation with a sypersymmetric potential of the specific form El and a diffusion process 
described by a Fokker-Planck equation of the standard form ©• The transformation 

p(x, t) — i/>(x, t)e~ u ( x ^ 2 (25) 

applied to the original FPE © yields the Schrodinger equation with imaginary time 

9-0 , fiyuf AU\ 

~m = A ^(—- r) (26) 

Comparing El with El , we conclude that the eigenvalues of the operator m are the same 
as those of a Fokker-Planck equation with a potential 2 U(x). Therefore, in general, for data 
sampled from the SDE s. there is no direct correspondence between the eigenvalues and 
eigenfunctions of the normalized graph Laplacian and those of the corresponding Fokker- 
Planck equation ©. However, when the original potential U(x) has two metastable states 
separated by a large barrier, corresponding to two well separated clusters, so does 2 U(x). 
Therefore, the Erst non-trivial eigenvalue is governed by the mean passage time between 
the two barriers, and the first non-trivial eigenfunction gives a parametrization of the path 
between them (see also the analysis of the next section). 

We note that in Bm, Horn and Gottlieb suggested a clustering algorithm based on the 
Schrodinger operator El, where they constructed an approximate eigenfunction ip(x) = 
p e {x) as in our eq. [TTJ, and computed its corresponding potential V(x) from eq. El- The 
clusters were then defined by the minima of the potential V. Employing the same asymptotic 
analysis of this paper, one can show that in the appropriate limit, the computed potential 
V is given by El- This asymptotic analysis and the connection between the quantum 
operator and a diffusion process thus provides a mathematical explanation for the success 
of their method. Indeed, when U has a deep parabolic minima at a point x, corresponding 
to a well defined cluster, so does V. 


4 Anisotropic Diffusion Maps 

In the previous section we showed that asymptotically, the eigenvalues and eigenfunctions 
of the normalized graph Laplacian operator correspond to the Fokker-Planck equation with 
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a potential 2U(x) instead of the single U(x). In this section we present a different normal¬ 
ization that yields infinitesimal generators corresponding to the potential U(x) without the 
additional factor of two. 

In fact, following [Zj we consider in more generality a whole family of different normal¬ 
izations and their corresponding diffusions, and we show that, in addition to containing 
the graph Laplacian normalization used in the previous section, this collection of diffusions 
includes at least two other Laplacians of interest: the Laplace-Beltrami operator, which cap¬ 
tures the Riemannian geometry of the data set, and the backward Fokker-Planck operator 
of equation ©• 

Instead of applying the graph Laplacian normalization to the isotropic kernel k e (x, y ), we 
first appropriately renormalize the kernel into an anisotropic one to obtain a new weighted 
graph, and then apply the graph Laplacian normalization to this graph. More precisely, 
we proceed as follows: start with a Gaussian kernel k s (x, y) and let a > 0 be a parameter 
indexing our family of diffusions. Define an estimate for the local density as 

Pe{x ) = J k e (x,y)Pv{y}dy 
and consider the family of kernels 


kj“\x,y)= 

Pe{x)Pe{y ) 

We now apply the graph Laplacian normalization by computing the normalization factor 


4 a) (y) = / k^ a) (x,y)Pr{x}d; 


x 


and forming a forward transition probability kernel 


Mi a \x\y) = Pr{a;(f + e) — x\x(t) — y} — 


ke a \x,y) 

di a \y) 


Similar to the analysis of section l3~2l we can construct the corresponding forward, symmetric 
and backward diffusion kernels. It can be shown (see appendix 0 that the forward and 
backward infinitesimal generators of this diffusion are 


n { b a) ^ = A^ - 2(1 - a)'V</> ■ W 

U { fp = Aip - 2aWp ■ WU + (2a - 1)^ ((VI/) 2 - AU) 

We mention three interesting cases: 


(27) 

(28) 


For a = 0, this construction yields the classical normalized graph Laplacian with the 
infinitesimal operator given by equation m 


Hfip = A ip — (e u Ae u ) ip 
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• For a — 1, the backward generator gives the Laplace-Beltrami operator: 

H b *l) = (29) 

In other words, this diffusion captures only the geometry of the data, in which the den¬ 
sity e~ u plays absolutely no role. Therefore, this normalization separates the geometry 
of the underlying manifold from the statistics on it. 

• For a — |, the infinitesimal operator of the forward and backward operators coincide 
and are given by 

7 if(p = 'Hb'-p = A tf — V (p ■ VU (30) 

which is exactly the backward FPE ©, with a potential U{x). 

Therefore, the last case with a = 1/2 provides a consistent method to approximate the 
eigenvalues and eigenfunctions corresponding to the stochastic differential equation © . This 
is done by constructing a graph Laplacian with an appropriately anisotropic weighted graph. 

As explained in the Euclidian distance between any two points after the dif¬ 

fusion map embedding into is almost equal to their diffusion distance on the original 
dataset. Thus, for dynamical systems with only one or two slow time scales, and many fast 
time scales, only a small number of diffusion map coordinates need be retained for the coarse 
grained representation of the data at medium to long times, at which the fast coordinates 
have equilibrated. Therefore, the diffusion map can be considered as an empirical method to 
perform data-driven or equation-free homogenization. In particular, since this observation 
yields a computational method for the approximation of the top eigenfunctions and eigenval¬ 
ues, this method can be applied towards the design of fast and efficient simulations that can 
be initialized on arbitrary points on the diffusion map. This application will be described in 
a separate publication '23J. 


5 Examples 

In this section we present the potential strength of the diffusion map method by analyzing, 
both analytically and numerically a few toy examples, with simple potentials U(x). More 
complicated high dimensional examples of stochastic dynamical systems are analyzed in 
[21] . while other applications such as the analysis of images for which we typically have no 
underlying probability model appear in [7J. 

5.1 Parabolic potential in 1-D 

We start with the simplest case of a parabolic potential in one dimension, which in the context 
of the SDE ©, corresponds to the well known Ornstein-Uhlcnbeck process. We thus consider 
a potential U(x ) = x 2 /2 r, with a corresponding normalized density p = e~ u /\J2 ht. 
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The normalization factor p £ can be computed explicitly 


r e -(x-y) 2 /2 e e -x 2 / 2 r 1 

/ £_f_ Hr = r -y 2 /2(r+e) 

J \/2n£ V2^ ' y/2ir(r + e) 

where, for convenience, we multiplied the kernel k £ {x,y) by a normalization factor 1/x/27re. 
Therefore, the eigenvalue/eigenfunction problem for the symmetric operator T s with a hnite 
ereads 



= 


exp 


{x-yf 

2e 


exp 


x 2 + r/ 2 


. y 

exp 1 -- 


(e + r) 

The first eigenfunction, with eigenvalue Aq = 1 is given by 


T + £ 


T 


<p(y)dy = X<p(x) 


(p 0 (x) = C\J p £ (x) = 
The second eigenfunction, with eigenvalue Ai 

<Pi(x) = xexp 


Cexp (-4(777)) 

= r/(r + e) < 1 is, up 

(-iAh 


to normalization 


In general, the sequence of eigenfunctions and eigenvalues is characterized by the following 
lemma: 

Lemma: The eigenvalues of the operator T s are Afc = (t/(t + e )) k , with the corresponding 
eigenvectors given by 

x)=PtWex p(-4(7T7)) 

where pk is a polynomial of degree k (even or odd depending on k ). 

In the limit £ —> 0, we obtain the eigenfunctions of the corresponding infinitesimal gen¬ 
erator. For the specific potential U(x) = x 2 /2 t, the eigenfunction problem for the backward 
generator reads 

ipxx - 2-i> x = -Xip (32) 

r 

The solutions of this eigenfunction problem are, up to scaling of x, the well known Her- 
rnite polynomials, which by the correspondence of this operator to the Schrodinger eigen¬ 
vector/eigenvalue problem, are also the eigenfunctions of the quantum harmonic oscillator 
(after multiplication by the appropriate Gaussian) jiTHl. 

Note that plotting the second vs. the first eigenfunctions (with the convention that the 
zeroth eigenfunction is the constant one, which we typically ignore), is the same as plotting 
x 2 + 1 vs x, e.g. a parabola. Therefore, we expect that for a large enough and yet hnite 
data-set sampled from this potential, the plot of the corresponding discrete eigenfunctions 
should lay on a parabolic curve (see next section for a numerical example). 
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5.2 Multi-Dimensional Parabolic Potential 


We now consider a harmonic potential in n-dimensions, of the form 

u w = < 33 ) 

3 J 

where, in addition, we assume T\ r 2 > r 3 > ... > r n , so that X\ is a slow variable in the 
context of the SDE ((21). 

We note that for this specific potential, the probability density has a separable structure, 
p(x ) = pi(x\) .. .p n (x n ), and so does the kernel k £ (x,y), and consequently, also the sym¬ 
metric kernel M s (x,y). Therefore, there is an outer-product structure to the eigenvalues 
and eigenfunctions. For example, in two dimensions the eigenfunctions and eigenvalues are 
given by 

Pi,j(x i,x 2 ) =<p li i(x 1 )ip 2 ,j( x 2) and = ( 34 ) 

where /j i = T\/{t\ + e) and /i 2 = t 2 /(t 2 + e). Since by assumption T\ r 2 , then upon 
ordering of the eigenfunctions by decreasing eigenvalue, the first non-trivial eigenfunctions 
are <^i i0 , <^ 2 ,o, • • •, which depend only on the slow variable X\. Note that indeed, regardless of 
the value of e, as long as r 2 > 2ri, we have that > A 2 . Therefore, in this example the first 
few coordinates of the diffusion map give a (redundant) parametrization of the slow variable 
X\ in the system. 

In figure ©we present numerical results corresponding to a 2-dimensional potential with 
T\ = 1, t 2 = 1/25. In the upper left some 3500 points sampled from the distribution p = e~ u 
are shown. In the lower right and left panels, the first two non-trivial backward eigenfunctions 
-01 and 0 2 are plotted vs. the slow variable aq. Note that except at the edges, where the 
statistical sampling is poor, the first eigenfunction is linear in aq, while the second one is 
quadratic in x x . In the upper right panel we plot 0 2 vs. and note that they indeed lie on 
a parabolic curve, as predicted by the analysis of the previous section. 


5.3 A potential with two minima 


We now consider a double well potential U(x) with two minima, one at xl and one at Xr. 
For simplicity of the analysis, we assume a symmetric potential around (xl + Xr)/2, with 
U(xl ) = U(xr) = 0 (see figure©. In the context of data clustering, this can be viewed as 
approximately a mixture of two Gaussian clouds, while in the context of stochastic dynamical 
systems, this potential defines two meta-stable states. 

We first consider an approximation to the quantity p e (x), given by eq. (|12|) . For x near 
XL, U(x) ~ (x — xl) 2 /tl, while for x near xr, U{x) ~ [x — xr) 2 /tr. Therefore, 

g ~ U (y) ~ e ~(y-XL) 2 /2T L _j_ e ~(y-X R ) 2 /2 tr 


and 


p £ (x) « 


1 

7! 


( c ~(x-xl) 2 /2(r L +e) _|_ c ~(x~x R ) 2 /2 (t r +e) 

\ V r L + ^ yJ T R J r£ 
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Figure 1: The anisotropic diffusion map on a harmonic potential in 2-D. 


1 

71 


[<Pl{x) + tp R {x)\ 


(36) 


where (fiL and are the hrst forward eigenfunctions corresponding to a single well potential 
centered at xl or at x R , respectively. As is well known both in the theory of quantum physics 
and in the theory of the Fokker-Planck equation, an approximate expression for the next 
eigenfunction is 


<Pi(x) 


1 

71 


Wl{x) - ip R {x)\ 


Therefore, the hrst non-trivial eigenfunction of the backward operator is given by 


lU VL {x)+ VR (x) 


This eigenfunction is roughly +1 in one well and —1 in the other well, with a sharp transition 
between the two values near the barrier between the two wells. Therefore, this eigenfunction 
is indeed suited for clustering. Moreover, in the context of a mixture of two Gaussian clouds, 
clustering according to the sign of ipi(x) is asymptotically equivalent to the optimal Bayes 
classifier. 

Example: Consider the following potential in two dimensions, 


TT/ . 1 4 25 o 9 2 „_y 2 

U(x, y) = - x 4 - x 3 + -x 2 + 25— 

v 4 12 2 2 


(37) 


In the x direction, this potential has a double well shape with two minima, one at x = 0 and 
one at x — 4, separated by a potential barrier with a maximum at x — 2.25. 

In figure |21 we show some numerical results of the diffusion map on some 1200 points 
sub-sampled from a stochastic simulation with this potential which generated about 40,000 
points. On the upper right panel we see the potential U(x, 0), showing the two wells. In the 
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Figure 2: Numerical results for a double well potential in 2-D. 

upper left, a scatter plot of all the points, color coded according to the value of the local 
estimated density p e , (with e = 0.25) is shown, where the two clusters are easily observed. 
In the lower left panel, the first non-trivial eigenfunction is plotted vs. the first coordinate 
x. Note that even though there is quite a bit of variation in the y-variable inside each of 
the wells, the first eigenfunction ipi is essentially a function of only x, regardless of the value 
of y. In the lower right we plot the first three backward eigenfunctions. Note that they all 
lie on a curve, indicating that the long time asymptotics are governed by the passage time 
between the two wells and not by the local fluctuations inside them. 


5.4 Potential with three wells 


We now consider the following two dimensional potential energy with three wells, 


U(x, y) = 3(3e~ 


v 2 e —(2/—!/3) 2 _ g—(j/—5/3) 2 


5/3e 


-y 


g-(z-i ) 2 _|_ g-^+i ) 2 


(38) 


where f3 = 1/kT is a thermal factor. This potential has two deep wells at (—1,0) and at 
(1,0) and a shallower well at (0,5/3), which we denote as the points L,R,C, respectively, 
The transitions between the wells of this potential have been analyzed in many works |21| . 
In figure |21 we plotted on the left the results of 1400 points sub-sampled from a total of 80000 
points randomly generated from this potential confined to the region [—2.5,2.5] 2 C M 2 at 
temperature (3 = 2, color-coded by their local density. On the right we plotted the first 
two diffusion map coordinates ^(cc), ^ 2 {x). Notice how in the diffusion map space one can 
clearly see a triangle where each vertex corresponds to one of the points L , R, C. This figure 
shows very clearly that there are two possible pathways to go from L to R. A direct (short) 
way and an indirect longer way, that passes through the shallow well centered at C. 
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Figure 3: Numerical results for a triple well potential in 2-D. 



Figure 4: Diffusion map for the iris data set. 


5.5 Iris data set 

We conclude this section with a diffusion map analysis of one of the most popular multivariate 
datasets in pattern recognition, the iris data set. This set contains 3 distinct classes of 
samples in four dimensions, with 50 samples in each class. In figure 0 we see on the left 
the result of the three dimensional diffusion map on this dataset. This picture clearly shows 
that all 50 points of class 1 (blue) are shrunk into a single point in the diffusion map space 
and can thus be easily distinguished from classes two and three (red and green). In the 
right plot we see the results of re-running the diffusion map on the 100 remaining red and 
green samples. The 2-D plot of the first two diffusion maps coordinates shows that there is 
no perfect separation between these two classes. However, clustering according to the sign 
of ij) i(ce) gives misclassihcations rates similar to those of other methods, of the order of 6-8 
samples depending on the value chosen for the kernel width e. 


6 Summary and Discussion 

In this paper, we introduced a mathematical framework for the analysis of diffusion maps, 
via their corresponding infinitesimal generators. Our results show that diffusion maps are a 
natural method for the analysis of the geometry and probability distribution of empirical data 
sets. The identification of the eigenvectors of the Markov chain as discrete approximations 
to the corresponding differential operators provides a mathematical justification for their 
use as a dimensional reduction tool and gives an alternative explanation for their empirical 
success in various data analysis applications, such as spectral clustering and approximations 
of optimal normalized cuts on discrete graphs. 

We generalized the standard construction of the normalized graph Laplacian to a one- 
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parameter family of graph Laplacians that provides a low-dimensional description of the 
data combining the geometry of the set with the probability distribution of the data points. 
The choice of the diffusion map depends on the task at hand. If, for example, data points 
are known to approximately lie on a manifold, and one is solely interested in recovering 
the geometry of this set, then an appropriate normalization of a Gaussian kernel allows to 
approximate the Laplace-Beltrami operator, regardless of the density of the data points. 
This construction achieves a complete separation of the underlying geometry, represented by 
the knowledge of the Laplace operator, from the statistics of the points. This is important in 
situations where the density is meaningless, and yet points on the manifold are not sampled 
uniformly on it. In a different scenario, if the data points are known to be sampled from the 
equilibrium distribution of a Fokker-Planck equation, the long-time dynamics of the density 
of points can be recovered from an appropriately normalized random walk process. In this 
case, there is a subtle interaction between the distribution of the points and the geometry 
of the data set, and one must correctly account for the density of the points. 

While in this paper we analyzed only Gaussian kernels, our asymptotic results are valid 
for general kernels, with the appropriate modification that take into account the mean and 
covariance matrix of the kernel. Note, however, that although asymptotically in the limit 
N —> oo and e —> 0, the choice of the isotropic kernel is unimportant, for a finite data set 
the choice of both e and the kernel can be crucial for the success of the method. 

Finally, in the context of dynamical systems, we showed that diffusion maps with the 
appropriate normalization constitute a powerful tool for the analysis of systems exhibiting 
different time scales. In particular, as shown in the different examples, these time scales can 
be separated and the long time dynamics can be characterized by the top eigenfunctions 
of the diffusion operator. Last, our analysis paves the way for fast simulations of physical 
systems by allowing larger integration steps along slow variable directions. The exact details 
required for the design of fast and efficient simulations based on diffusion maps will be 
described in a separate publication |23J- 

Acknowledgments: The authors would like to thank the referee for helpful suggestions 
and for pointing out ref. EDI- 


A Infinitesimal operators for a family of graph Lapla¬ 
cians 

In this appendix, we present the calculation of the infinitesimal generators for the different 
diffusion maps characterized by a parameter a. 

Suppose that the data set X consists of a Riemannian manifold with a density p(x) = 
e -u(x) an( ] } e t k e (x,y) be a Gaussian kernel. It was shown in [7] that if k e is scaled appro¬ 
priately, then for any function (j) on A", 

/ k £ (x, y)<t>{y)dy = 4>(x) + e(A<j>(x) + q(x)<j>(x)) + 0{e%) 

Jx 
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where q is a function that depends on the Riemannian geometry of the manifold and its 
embedding in M n . Using the notations introduced in section UJ it is easy to verify that 

p £ (x) = p(x) + e(Ap(x) + q(x)p(x)) + 0(e 3//2 ) 


and consequently, 


Let 


= i,-" (1 -atfff-A] fl + 0(£ 3/2 )) 

y) = 

Ps( x )Ps(y) 


Then, the normalization factor is given by 
d { f*\x) = [ k { e a \x,y)p(y)dy = pf a (x)p 1 ~ a (x) 


, \ A p A p 1 

1 + e ( (1 - a)q - a— + ——— 
V P P l ~ a {x ) 


P 


l—a 


Therefore, the asymptotic expansion of the backward operator gives 

rr{a) M f k e a \ x ,y)^f w w ^ , ( A(# 1_ “^ 

H 4 = / ;(a)/ . 4>{y)p{y)dy = H x ) + £ 

J X de (x) 

and its inhnitesimal generator is 

Hb<t> = lim 


A p 


1—a 


p 


l—a 


T h — I 


A (cfp 1 a ) A (p 1 “) 


-,1 —a 


- jl —ol 


e p ^ 

Inserting the expression p = e~ u into the last equation gives 

H b <f> = A0 - 2(1 - a) V0 • VU 
Similarly, the form of the forward inhnitesimal operator is 

Hfi) = Av/> - 2a'Vi/> • VU + (2a - l)ip (VU • VU - AU) 
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